I will be performing an exploratory data analysis (EDA) for the Vancouver Street Trees dataset located here. This dataset contains public trees on boulevards, detailing their coordinates, species, and other characteristics. It excludes park and private trees. The dataset is updated daily on weekdays, the data reflects changes based on priorities and resources, sometimes taking years for certain updates. I'm exploring this dataset because I want to gain an insight on the current generation condition and state of Vancouver trees.
import altair as alt
import pandas as pd
import os
#alt.data_transformers.enable('data_server')
Columns of Interest:
species_name: Name of the tree species.date_planted: The date the tree was planted.neighbourhood_name: The name of the neighborhood where the tree is located.latitude: The latitude coordinate of the tree.longitude: The longitude coordinate of the tree.diameter: The diameter of the tree trunk measured as DBH (diameter of tree at breast height) in inches.height_range_id: Categorical height range of the tree measured in feet. 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and 10 = 100+ ft). Data Quality Reivew
I'll be importing the raw data and extracting some additional information about the tables. Specifically, I want to know which data types are used and if there are any gaps (nulls) in the data. Firstly, we'll get a snapshot of how the dataset looks like and see if there are any patterns to the missing data.
vancity_trees = pd.read_csv('small_unique_vancouver__20240624182849.csv')
vancity_trees.info()
vancity_trees.head(5)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 5000 non-null int64 1 std_street 5000 non-null object 2 on_street 5000 non-null object 3 species_name 5000 non-null object 4 neighbourhood_name 5000 non-null object 5 date_planted 2363 non-null object 6 diameter 5000 non-null float64 7 street_side_name 5000 non-null object 8 genus_name 5000 non-null object 9 assigned 5000 non-null object 10 civic_number 5000 non-null int64 11 plant_area 4950 non-null object 12 curb 5000 non-null object 13 tree_id 5000 non-null int64 14 common_name 5000 non-null object 15 height_range_id 5000 non-null int64 16 on_street_block 5000 non-null int64 17 cultivar_name 2658 non-null object 18 root_barrier 5000 non-null object 19 latitude 5000 non-null float64 20 longitude 5000 non-null float64 dtypes: float64(3), int64(5), object(13) memory usage: 820.4+ KB
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 1 | 12573 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | ... | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 |
| 2 | 29676 | ROSS ST | ROSS ST | NIGRA | Sunset | NaN | 12.0 | ODD | PINUS | N | ... | 7 | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 |
| 3 | 8856 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | ... | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 |
| 4 | 21098 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaN | 15.5 | ODD | AESCULUS | Y | ... | N | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 |
5 rows × 21 columns
This observation sees 5000 data entries, 21 total columns consisting of 8 numerical columns (float64 and int64). We're seeing some columns doesn't have 5000 entries indicating there are null values. Let's observe how many null values there are.
null_values = vancity_trees.isnull().sum()
print(null_values)
Unnamed: 0 0 std_street 0 on_street 0 species_name 0 neighbourhood_name 0 date_planted 2637 diameter 0 street_side_name 0 genus_name 0 assigned 0 civic_number 0 plant_area 50 curb 0 tree_id 0 common_name 0 height_range_id 0 on_street_block 0 cultivar_name 2342 root_barrier 0 latitude 0 longitude 0 dtype: int64
To answer my questions, I will need the following columns. Since date_planted is missing half of the dates, it won't be useful for analysis. However, I will keep it for now to see if there are any patterns to the missing data.
# Note: analysis on latitude and longitude aren't useful.
trees_df = vancity_trees[[
"on_street",
"species_name",
"neighbourhood_name",
"date_planted",
"diameter",
"genus_name",
"common_name",
"height_range_id",
"root_barrier",
"latitude",
"longitude"
]]
trees_df
trees_df.describe()
| diameter | height_range_id | latitude | longitude | |
|---|---|---|---|---|
| count | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 |
| mean | 12.340888 | 2.73440 | 49.247349 | -123.107128 |
| std | 9.266600 | 1.56957 | 0.021251 | 0.049137 |
| min | 0.000000 | 0.00000 | 49.202783 | -123.220560 |
| 25% | 4.000000 | 2.00000 | 49.230152 | -123.144178 |
| 50% | 10.000000 | 2.00000 | 49.247981 | -123.105861 |
| 75% | 18.000000 | 4.00000 | 49.263275 | -123.063484 |
| max | 71.000000 | 9.00000 | 49.293930 | -123.023311 |
trees_df.describe(exclude="number", datetime_is_numeric=True)
| on_street | species_name | neighbourhood_name | date_planted | genus_name | common_name | root_barrier | |
|---|---|---|---|---|---|---|---|
| count | 5000 | 5000 | 5000 | 2363 | 5000 | 5000 | 5000 |
| unique | 607 | 171 | 22 | 1599 | 67 | 361 | 2 |
| top | CAMBIE ST | SERRULATA | Renfrew-Collingwood | 2004-02-16 | ACER | KWANZAN FLOWERING CHERRY | N |
| freq | 49 | 463 | 384 | 7 | 1218 | 383 | 4679 |
Is there a pattern to the missing data? There are 5000 data entries in this sample dataset. Columns "date_planted" and "cultivar_name" have a significant amount of missing data of about 50% recorded rate. I will now create a visualization of the missing values to help us identify possible trends.
# Disable maxrows to prevent error code
alt.data_transformers.disable_max_rows()
# Filter the DataFrame to include only the specified columns
filtered_vancity_trees = vancity_trees[['cultivar_name', 'date_planted', 'plant_area']]
# Create a DataFrame that indicates where values are null
null_df = filtered_vancity_trees.isnull().reset_index().melt(id_vars='index', var_name='column', value_name='NaN')
# Create the heatmap
heatmap = alt.Chart(null_df).mark_rect(height=17).encode(
x='index:O',
y='column',
color='NaN',
stroke='NaN').properties(
width=1200
)
heatmap
There are too missing date_planted and cultivar_name data in this dataset. These columns would not be useful for analysis. However, we can try to see if there are any patterns with these missing data.
# Convert the date_planted column to datetime format
vancity_trees['date_planted'] = pd.to_datetime(vancity_trees['date_planted'], errors='coerce')
# Extract the year and month from the date_planted column
vancity_trees['year_planted'] = vancity_trees['date_planted'].dt.year
vancity_trees['month_planted'] = vancity_trees['date_planted'].dt.month
# Filter out rows where date_planted is NaN
vancity_trees_filtered = vancity_trees.dropna(subset=['date_planted'])
# Create the scatter plot
scatter_plot = alt.Chart(vancity_trees_filtered).mark_circle(size=60).encode(
x=alt.X('year_planted:O', title='Year Planted'),
y=alt.Y('month_planted:O', title='Month Planted'),
tooltip=['date_planted', 'species_name', 'neighbourhood_name']
).properties(
width=800,
height=400,
title='Trees Planted by Month and Year'
).interactive()
scatter_plot
The frequency plot doesn't show any trends with the other missing data. However, it's interesting to observe the scatterplot showing low data inputs for the June, July, August, and September months. This "anomoly" may make sense knowing trees are typically planted in the fall or spring season.
# Filter the dataset to include only rows where date_planted is NaN
vancity_trees_nan = vancity_trees[vancity_trees['date_planted'].isna()]
# Create the bar chart
bar_chart_nan = alt.Chart(vancity_trees_nan).mark_bar().encode(
x=alt.X('genus_name', sort='-y', title='Tree Species'),
y=alt.Y('count()', title='Number of Trees')
).properties(
width=800,
height=400,
title='Number of Trees by Species with Unknown Planting Date'
).configure_axis(
labelAngle=-45 # Rotate x-axis labels for better readability
)
bar_chart_nan
bar_chart_nan_neighbourhood = alt.Chart(vancity_trees_nan).mark_bar().encode(
x=alt.X('neighbourhood_name', sort='-y', title='Neighborhood'),
y=alt.Y('count()', title='Number of Trees')
).properties(
width=800,
height=400,
title='Number of Trees by Neighborhood with Unknown Planting Date'
).configure_axis(
labelAngle=-45 # Rotate x-axis labels for better readability
)
bar_chart_nan_neighbourhood
It is observed Prunus and Acer two two most common species for trees with missing dates. We will see later on if this is an anomoly when compared to all the Vancouver trees.
neighbourhood_tree_counts = vancity_trees.groupby('neighbourhood_name').size().reset_index(name='tree_count')
neighbourhood_tree_counts_sorted = neighbourhood_tree_counts.sort_values(by='tree_count', ascending=False)
neighbourhood_tree_counts_sorted = neighbourhood_tree_counts_sorted.reset_index(drop=True)
neighbourhood_tree_counts_sorted
| neighbourhood_name | tree_count | |
|---|---|---|
| 0 | Renfrew-Collingwood | 384 |
| 1 | Kensington-Cedar Cottage | 375 |
| 2 | Hastings-Sunrise | 340 |
| 3 | Dunbar-Southlands | 313 |
| 4 | Sunset | 309 |
| 5 | Victoria-Fraserview | 292 |
| 6 | Kitsilano | 269 |
| 7 | Kerrisdale | 252 |
| 8 | Riley Park | 246 |
| 9 | Marpole | 236 |
| 10 | Shaughnessy | 226 |
| 11 | Killarney | 200 |
| 12 | Mount Pleasant | 199 |
| 13 | Grandview-Woodland | 193 |
| 14 | Oakridge | 190 |
| 15 | West Point Grey | 176 |
| 16 | Arbutus-Ridge | 175 |
| 17 | Downtown | 171 |
| 18 | Fairview | 134 |
| 19 | West End | 127 |
| 20 | South Cambie | 118 |
| 21 | Strathcona | 75 |
bar_chart_neighbourhood = alt.Chart(neighbourhood_tree_counts_sorted).mark_bar().encode(
x=alt.X('neighbourhood_name', sort='-y', title='Neighborhood'),
y=alt.Y('tree_count', title='Number of Trees')
).properties(
width=800,
height=400,
title='Number of Trees by Neighborhood with Unknown Planting Date'
).configure_axis(
labelAngle=-45 # Rotate x-axis labels for better readability
)
bar_chart_neighbourhood
We would want to overlay this chart with the "Number of Trees by Neighborhood with Unknown Planting Date" chart to look for anomolies with the missing data.
top_10_species = trees_df["species_name"].value_counts()[:10]
top_10_species = top_10_species.sort_values(ascending=False).reset_index()
top_10_species
| index | species_name | |
|---|---|---|
| 0 | SERRULATA | 463 |
| 1 | PLATANOIDES | 444 |
| 2 | CERASIFERA | 396 |
| 3 | RUBRUM | 261 |
| 4 | AMERICANA | 182 |
| 5 | SYLVATICA | 178 |
| 6 | BETULUS | 170 |
| 7 | EUCHLORA X | 152 |
| 8 | FREEMANI X | 127 |
| 9 | CAMPESTRE | 124 |
top_10_trees = trees_df["common_name"].value_counts()[:10]
top_10_trees = top_10_trees.sort_values(ascending=False).reset_index()
top_10_trees
| index | common_name | |
|---|---|---|
| 0 | KWANZAN FLOWERING CHERRY | 383 |
| 1 | PISSARD PLUM | 295 |
| 2 | NORWAY MAPLE | 215 |
| 3 | CRIMEAN LINDEN | 152 |
| 4 | PYRAMIDAL EUROPEAN HORNBEAM | 100 |
| 5 | NIGHT PURPLE LEAF PLUM | 97 |
| 6 | KOBUS MAGNOLIA | 93 |
| 7 | AKEBONO FLOWERING CHERRY | 88 |
| 8 | RED MAPLE | 81 |
| 9 | KATSURA TREE | 78 |
The most common tree species in Vancouver is the Serrulata, also known as Japanese cherry. The most common name tree is the Kwanzan Flowering Cherry.
# Group by neighbourhood_name and find the maximum height of the trees in each neighborhood
neighbourhood_tallest_trees = vancity_trees.groupby('neighbourhood_name')['height_range_id'].max().reset_index(name='max_height')
# Sort the dataframe by max_height in descending order to find the neighborhood with the tallest trees
neighbourhood_tallest_trees_sorted = neighbourhood_tallest_trees.sort_values(by='max_height', ascending=False)
# Reset the index to count from 1
neighbourhood_tallest_trees_sorted = neighbourhood_tallest_trees_sorted.reset_index(drop=True)
neighbourhood_tallest_trees_sorted.index = neighbourhood_tallest_trees_sorted.index + 1
# Display the sorted dataframe
neighbourhood_tallest_trees_sorted
| neighbourhood_name | max_height | |
|---|---|---|
| 1 | West Point Grey | 9 |
| 2 | Grandview-Woodland | 9 |
| 3 | Kensington-Cedar Cottage | 9 |
| 4 | Shaughnessy | 9 |
| 5 | Kitsilano | 9 |
| 6 | Renfrew-Collingwood | 9 |
| 7 | Oakridge | 8 |
| 8 | West End | 8 |
| 9 | Sunset | 8 |
| 10 | South Cambie | 8 |
| 11 | Riley Park | 8 |
| 12 | Arbutus-Ridge | 8 |
| 13 | Kerrisdale | 8 |
| 14 | Fairview | 8 |
| 15 | Dunbar-Southlands | 8 |
| 16 | Mount Pleasant | 8 |
| 17 | Downtown | 7 |
| 18 | Hastings-Sunrise | 7 |
| 19 | Strathcona | 7 |
| 20 | Victoria-Fraserview | 7 |
| 21 | Marpole | 6 |
| 22 | Killarney | 6 |
This shows the tallest tree height for each neighbourhood. West Point Gret, Grandiew-Woodland, Kensington-Cedar Cottage, Shaughnessy, Kitsilano, and Renfrew Collingwood have the tallest trees in Vancouver. Since cherry blossom trees are the most common tree in Vancouver, we will focus our analysis based on just the Japanese Cherry blossom trees.
cherry_trees = vancity_trees[vancity_trees['species_name'] == 'SERRULATA']
# Group by neighbourhood_name and count the number of cherry blossom trees in each neighborhood
neighbourhood_cherry_counts = cherry_trees.groupby('neighbourhood_name').size().reset_index(name='cherry_tree_count')
# Sort the dataframe by cherry_tree_count in descending order
neighbourhood_cherry_counts_sorted = neighbourhood_cherry_counts.sort_values(by='cherry_tree_count', ascending=False)
neighbourhood_cherry_counts_sorted = neighbourhood_cherry_counts_sorted.reset_index(drop=True)
neighbourhood_cherry_counts_sorted
| neighbourhood_name | cherry_tree_count | |
|---|---|---|
| 0 | Mount Pleasant | 44 |
| 1 | Dunbar-Southlands | 35 |
| 2 | Victoria-Fraserview | 35 |
| 3 | Renfrew-Collingwood | 35 |
| 4 | Kensington-Cedar Cottage | 30 |
| 5 | Marpole | 29 |
| 6 | Sunset | 26 |
| 7 | Riley Park | 24 |
| 8 | Oakridge | 23 |
| 9 | Kerrisdale | 22 |
| 10 | South Cambie | 20 |
| 11 | Shaughnessy | 20 |
| 12 | Grandview-Woodland | 19 |
| 13 | West End | 16 |
| 14 | Arbutus-Ridge | 16 |
| 15 | Kitsilano | 16 |
| 16 | Killarney | 16 |
| 17 | Hastings-Sunrise | 11 |
| 18 | Fairview | 10 |
| 19 | West Point Grey | 7 |
| 20 | Strathcona | 5 |
| 21 | Downtown | 4 |
The Mount Pleasant neighbourhood has the most Japanese Cherry trees.
For all trees:
tree_size_scatter = alt.Chart(trees_df).mark_circle().encode(
alt.X("diameter"),
alt.Y("height_range_id")
)
tree_size_line = alt.Chart(trees_df).mark_line(color = 'Red').encode(
alt.X("mean(diameter)"),
alt.Y("height_range_id")
)
tree_size_scatter + tree_size_line
There is a clear positive trend between the diameter of the tree and its height.
For cherry trees:
# Distribution of cherry tree height and diameter
alt.Chart(cherry_trees).mark_tick().encode(
alt.X("diameter"),
alt.Y("height_range_id")
)
Cherry trees also show a clear positive trend between the diameter of the tree and its height, as expected.
# Density plot
alt.Chart(cherry_trees).transform_density("diameter", groupby=["height_range_id"], as_=["diameter", "density"]).mark_area().encode(
x="diameter",
y="density:Q",
color="height_range_id:N",
opacity=alt.value(0.9)
)
Here we can that most shorter cherry trees (height range = 1) have a diameter of 4. We cannot draw a conclusion for cherry trees with height 4, because it doesn't have enough data as the diameter is cut at 20 and 33.
heatmap = alt.Chart(vancity_trees).mark_rect().encode(
x=alt.X('longitude:Q', bin=alt.Bin(maxbins=90), title='Longitude'),
y=alt.Y('latitude:Q', bin=alt.Bin(maxbins=90), title='Latitude'),
color=alt.Color('count():Q')
).properties(
width=550,
height=400
)
heatmap
heatmap = alt.Chart(vancity_trees).mark_rect().encode(
x=alt.X('longitude:Q', bin=alt.Bin(maxbins=35), title='Longitude'),
y=alt.Y('latitude:Q', bin=alt.Bin(maxbins=35), title='Latitude'),
color=alt.Color('count():Q')
).properties(
width=550,
height=400
)
heatmap
These heap maps shows us where the trees are bunched up in Vancouver. It shows how the trees are geographically distributed by creating a tree density map.
scatter_plot = alt.Chart(vancity_trees).mark_circle(size=60, opacity=0.5).encode(
x=alt.X('longitude:Q', title='Longitude', scale=alt.Scale(zero=False)),
y=alt.Y('latitude:Q', title='Latitude', scale=alt.Scale(zero=False)),
color=alt.Color('height_range_id:Q', title='Tree Height'),
tooltip=['species_name', 'neighbourhood_name', 'diameter']
).properties(
width=550,
height=400,
title='Scatter Plot of Trees in Vancouver by Longitude and Latitude'
)
scatter_plot
This scatter plot shows a "map" of Vancouver. It can better show us the patches of area without trees, and also shows how the trees are geographically distributed in height.
scatter_plot = alt.Chart(vancity_trees).mark_circle(size=60, opacity=0.5).encode(
x=alt.X('longitude:Q', title='Longitude', scale=alt.Scale(zero=False)),
y=alt.Y('latitude:Q', title='Latitude', scale=alt.Scale(zero=False)),
color=alt.Color('diameter:Q', title='Tree Diameter'),
tooltip=['species_name', 'neighbourhood_name', 'diameter']
).properties(
width=550,
height=400,
title='Scatter Plot of Trees in Vancouver by Longitude and Latitude'
)
scatter_plot
This scatter plot shows how the trees are geographically distributed in diameter.
The dataset has too many NaN values for date_planted, so this column of data is unreliable to calculate Vancouver distribution of tree age. However, we're able to see a clear range for the tree's diameter and height range.
According to the dataset schema, the height range is measured in feet. 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and 10 = 100+ ft).
Diameter of tree is measured as DBH in inches (DBH stands for diameter of tree at breast height).
numerical_columns = trees_df.select_dtypes('number').columns.tolist()
chart = alt.Chart(trees_df).mark_bar().encode(
alt.X(alt.repeat(), type='quantitative', bin=alt.Bin(maxbins=25)),
alt.Y('count()')
).properties(width=250, height=150).repeat(numerical_columns, columns=2)
chart
This shows the distribution of trees' diameter and height, as well as the location (latitude and longitute) distribution of the trees.
# Get the categorical columns from trees_df
categorical_columns = trees_df.select_dtypes("object").columns.tolist()
# Drop the columns that does not need to be visualized
categorical_columns.remove('on_street')
categorical_columns.remove('date_planted')
# Get the dataframe of the top_10_trees
top_10_trees_df = trees_df[trees_df["common_name"].isin(top_10_trees["index"])]
# Drop the columns that does not need to be visualized
top_10_trees_df = top_10_trees_df.drop(columns=['on_street', 'date_planted'])
# Plot the chart
chart = alt.Chart(top_10_trees_df).mark_bar().encode(
alt.X('count()'),
alt.Y(alt.repeat(), type='nominal', sort='x')
).properties(width=250, height=250).repeat(categorical_columns, columns=2)
chart
The most popular species is Cerasifera. The neighbourhood with the most trees is Renfrew-Collingwood. The most popular genus is Prunus. The most popular tree is the Kawanzan Flowering Cherry. Most trees do not have a root barrier.
# Drop the numerical columns 'longitude' and 'latitude'
numerical_columns.remove('longitude')
numerical_columns.remove('latitude')
diameter_order = []
for groupby_col in categorical_columns:
grouped_df = top_10_trees_df.groupby(groupby_col)
# Calculate the median diameter for each group
median_diameter = grouped_df['diameter'].median()
sorted_groups = median_diameter.sort_values()
sorted_list = sorted_groups.index.to_list()
# Extend the main list with the ordered group names
diameter_order.extend(sorted_list)
alt.Chart(top_10_trees_df).mark_boxplot().encode(
alt.X(alt.repeat("column"), type="quantitative"),
alt.Y(alt.repeat("row"), type="nominal", sort=diameter_order),
).properties(
width=350, height=350
).repeat(
column=numerical_columns, row=categorical_columns[1:4]
)
Explorating this relationship between the numerical and categorical data gives us interesting results. The genus Acer shows significant variability in both diameter and height, suggesting that this genus includes both small and large species. Norway Maple and Kwanzan Flowering Cherry have a broader range in diameter and height, which could indicate these species are popular and planted in various conditions.
I will be including 4 graphs in my Vancouver Tree analysis project. The graphs will be redone with suitable design choices for easy visualization for the audience. I would want to use the actual Vancouver tree dataset instead of the sample 5000 entries dataset. I want to show the accurate results from the real data.
The plot for question 2 will be visualized as a bar chart. Although it's simple, this is to give a general overview of the 10 most popular trees in Vancouver.
The plot for question 4 can include a bar graph, and include all types of Japanese cherry against the neighbourhoods. It will also include a legend showing which colour correspond to the type of tree.
Question 5 plot would stay more or less the same, but also include the ranges for specifically Japanese cherry trees. I will be including a title, and perhaps a different colour scheme.
For Question 6, I would want to create a visualization on the map of Vancouver visualizing the neighbourhood areas. This map will visualize a density plot using headmap and a tree height distribution using scatter plot. A different colour scheme would be used.
The audience will not need to see graphs with too many analytical values such as mean, medium, and boxplots because the report is meant for the general public to see and understand easily.